SMART High Precision: TREC 7
نویسندگان
چکیده
The Smart information retrieval project emphasizes completely automatic approaches to the understanding and retrieval of large quantities of text. We continue our work in TREC 7, concentrating on high precision retrieval. In particular, we present an in-depth analysis of our High-Precision Track results, including ooering evaluation approaches and measures for time dependent evaluation. We participated in the Query Track, making initial eeorts at analyzing query variability, one of the major obstacles for improving retrieval eeectiveness. In the Smart system, the vector-processing model of retrieval is used to transform both the available information requests as well as the stored documents into vectors of the form: D i = (w i1 ; w i2 ; : : :; w it) where D i represents a document (or query) text and w ik is the weight of term T k in document D i. A weight of zero is used for terms that are absent from a particular document, and positive weights characterize terms actually assigned. The assumption is that t terms in all are available for the representation of the information. The basic \tf*idf" weighting schemes used within SMART have been discussed many times. For TREC 7 we use the same basic weights and document length normalization as were developed at Cornell by Amit Singhal for TREC 4((3, 5]. Tests on various collections show that this indexing is reasonably collection independent and thus should be valid across a wide range of new collections. No human expertise in the subject matter is required for either the initial collection creation, or the actual query formulation. The same phrase strategy (and phrases) used in all previous TRECs (for example 2, 3, 4, 1]) are used for TREC 7. Any pair of adjacent non-stopwords is regarded as a potential phrase. The nal list of phrases is composed of those pairs of words occurring in 25 or more documents of the initial TREC 1 document set. Phrases are weighted with the same scheme as single terms. When the text of document D i is represented by a vector of the form (d i1 ; d i2 ; : : :; d it) and query Q j by the vector (q j1 ; q j2 ; : : :; q jt), a similarity (S) computation between the two items can conveniently be obtained as the inner product between corresponding weighted term vectors as follows: S(D i ; Q …
منابع مشابه
SMART in TREC 8
This year was a light year for the Smart Information Retrieval Project at SabIR Research and Cornell. We oÆcially participated in only the Ad-hoc Task and the Query Track. In the Ad-hoc Task, we made minor modi cations to our document weighting schemes to emphasize high-precision searches on shorter queries. This proved only mildly successful; the top relevant document was retrieved higher, but...
متن کاملThe E ect of Adding Relevance Information in
The eeects of adding information from relevant documents are examined in the TREC routing environment. A modiied Rocchio relevance feedback approach is used, with a varying number of relevant documents retrieved by an initial SMART search, and a varying number of terms from those relevant documents used to expand the initial query. Recall-precision evaluation reveals that as the amount of expan...
متن کاملDeriving Very Short Queries for High Precision and Recall (MultiText Experiments for TREC-7)
The main aim of the MultiText experiments for TREC-7 was to derive very short queries that would yield high precision and recall, using a hybrid of manual and automatic processes. Identical queries were formulated for adhoc and VLC runs. A query set derived automatically from the topic title words, with an average of 2.84 terms per query, achieved a reasonable but unexceptional average precisio...
متن کاملQueries for High Precision and Recall ( MultiText Experiments for TREC - 7 )
The main aim of the MultiText experiments for TREC-7 was to derive very short queries that would yield high precision and recall, using a hybrid of manual and automatic processes. Identical queries were formulated for adhoc and VLC runs. A query set derived automatically from the topic title words, with an average of 2.84 terms per query, achieved a reasonable but unexceptional average precisio...
متن کاملTREC-7 Ad-Hoc, High Precision and Filtering Experiments using PIRCS
In TREC-7, we participated in the main task of automatic ad-hoc retrieval as well as the high precision and filtering tracks. For ad-hoc, three experiments were done with query types of short (title section of a topic), medium (description section) and long (all sections) lengths. We used a sequence of five methods to handle the short and medium length queries. For long queries we employed a re...
متن کامل